Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(prometheus_scrape source): scrape endpoints in parallel #17660

Conversation

wjordan
Copy link
Contributor

@wjordan wjordan commented Jun 9, 2023

Fixes #17659.

@wjordan wjordan requested a review from a team June 9, 2023 23:20
@netlify
Copy link

netlify bot commented Jun 9, 2023

Deploy Preview for vector-project ready!

Name Link
🔨 Latest commit 5557211
🔍 Latest deploy log https://app.netlify.com/sites/vector-project/deploys/6483b3cee5520000089afbf8
😎 Deploy Preview https://deploy-preview-17660--vector-project.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@netlify
Copy link

netlify bot commented Jun 9, 2023

Deploy Preview for vrl-playground canceled.

Name Link
🔨 Latest commit 5557211
🔍 Latest deploy log https://app.netlify.com/sites/vrl-playground/deploys/6483b3ce15b52800088c2daf

@github-actions github-actions bot added the domain: sources Anything related to the Vector's sources label Jun 9, 2023
@wjordan
Copy link
Contributor Author

wjordan commented Jun 12, 2023

This still needs some additional work:

  1. (blocking issue) Pending requests can start to pile up on the heap causing unbounded memory growth. Scrape requests should implement a timeout less than the interval duration to prevent this.
    • Alternatively, skip future requests for the same endpoint if a previous scrape request still hasn't completed.
  2. Each request should spawn in a separate short-lived task to spread out the request-processing load across many threads.
    • Alternatively, each endpoint could spawn and reuse a single long-lived task which could be more efficient.
  3. The timing of endpoint requests should be distributed across the scrape interval instead of all executing at the same time, to spread out the scrape-request load more evenly.

@jszwedko
Copy link
Member

jszwedko commented Jun 12, 2023

Thanks for opening this!

  • (blocking issue) Pending requests can start to pile up on the heap causing unbounded memory growth. Scrape requests should implement a timeout less than the interval duration to prevent this.
    • Alternatively, skip future requests for the same endpoint if a previous scrape request still hasn't completed.

This seems like a separate issue no? Or is that behavior introduced by this PR?

  • Each request should spawn in a separate short-lived task to spread out the request-processing load across many threads.
    • Alternatively, each endpoint could spawn and reuse a single long-lived task which could be more efficient.

I'm not sure I follow this one. The normal pattern would be to just execute the tasks async and let tokio manage scheduling. I think that's what is happening here?

  • The timing of endpoint requests should be distributed across the scrape interval instead of all executing at the same time, to spread out the scrape-request load more evenly.

This feels like a nice-to-have. I wouldn't consider it blocking this PR.

@jszwedko jszwedko requested a review from StephenWakely June 26, 2023 15:25
@StephenWakely
Copy link
Contributor

This seems like a separate issue no? Or is that behavior introduced by this PR?

So before if you have a slow Prometheus endpoint, the single request will wait until it is finished. This does mean that, say you were scraping every 15 seconds - you may actually only get metrics every 30 seconds. Technically there is some data loss. Memory usage remains constant.

With this PR, the call to scrape occurs every 15 seconds regardless of how slow the endpoint is. The result is that if the scrape takes longer than 15 seconds the request is placed in a queue. This queue can get longer and longer and memory usage will creep up. Presumably Vector will eventually be killed.

I had to push things fairly hard to get a significant increase in memory. 10 endpoints scraped every second with a 20 second lag time for each request was using about 5gb after 30 minutes or so. So I'm not sure how much of a problem this would be in the real world.

Ideally we should implement the workarounds suggested by @wjordan. We just need to decide if we should do that before merging this or after?

@spencergilbert
Copy link
Contributor

Ideally we should implement the workarounds suggested by @wjordan. We just need to decide if we should do that before merging this or after?

I'd say before merging, I doubt we'd find the time to prioritize that unless/until someone reports it as a bug - which isn't a great user experience.

@nullren
Copy link
Contributor

nullren commented Jul 19, 2023

@wjordan i couldn't push to your branch, but i added timeouts here #18021. i'll make the timeouts configurable tomorrow 👍

one thing i noticed was that adding lots of endpoints, there was a bit of "set up" time where all the futures get created first and then they're run. so for example, with like 20 endpoints, when the first "tick" started, it was about 7 seconds for the actual requests to go out. not particularly problematic, but it was unexpected and not something i fully understand 🤔

edit: fixed ☝️ in my pr

@wjordan
Copy link
Contributor Author

wjordan commented Jul 20, 2023

@nullren has been making further progress on this in #18021, so I'm closing this PR in favor of that one, and add some followup to previous discussion here:

  • Each request should spawn in a separate short-lived task to spread out the request-processing load across many threads.
    • Alternatively, each endpoint could spawn and reuse a single long-lived task which could be more efficient.

I'm not sure I follow this one. The normal pattern would be to just execute the tasks async and let tokio manage scheduling. I think that's what is happening here?

My Rust / Tokio knowledge is limited, so I could be completely mistaken on this one. As I understand it, the current setup runs the scrapes async but not in separate tasks, so tokio does the network io in parallel, but can't run the request-processing load (HTTP parsing, event enrichment etc) across many threads, unless each client-request future is also wrapped in tokio::spawn. It's possible the task-spawning and context-switching overhead is much greater than the actual request-processing load anyway though, and may need profiling to know for sure - in any case it's a small optimization detail at best and probably not terribly important.

I had to push things fairly hard to get a significant increase in memory. 10 endpoints scraped every second with a 20 second lag time for each request was using about 5gb after 30 minutes or so. So I'm not sure how much of a problem this would be in the real world.

For reference, the real-world use cases both @nullren and I are working with involve thousands of endpoints- I had to kill Vector within seconds in my local testing.

@wjordan
Copy link
Contributor Author

wjordan commented Jul 20, 2023

Closing in favor of #18021

@wjordan wjordan closed this Jul 20, 2023
github-merge-queue bot pushed a commit that referenced this pull request Jul 24, 2023
…timeouts (#18021)

<!--
**Your PR title must conform to the conventional commit spec!**

  <type>(<scope>)!: <description>

  * `type` = chore, enhancement, feat, fix, docs
  * `!` = OPTIONAL: signals a breaking change
* `scope` = Optional when `type` is "chore" or "docs", available scopes
https://github.com/vectordotdev/vector/blob/master/.github/semantic.yml#L20
  * `description` = short description of the change

Examples:

  * enhancement(file source): Add `sort` option to sort discovered files
  * feat(new source): Initial `statsd` source
  * fix(file source): Fix a bug discovering new files
  * chore(external docs): Clarify `batch_size` option
-->

fixes #14087 
fixes #14132 
fixes #17659

- [x] make target timeout configurable

this builds on what @wjordan did in
#17660

### what's changed
- prometheus scrapes happen concurrently
- requests to targets can timeout
- the timeout can be configured (user facing change)
- small change in how the http was instantiated

---------

Co-authored-by: Doug Smith <dsmith3197@users.noreply.github.com>
Co-authored-by: Stephen Wakely <stephen@lisp.space>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: sources Anything related to the Vector's sources
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple prometheus_scrape endpoints not scraped in parallel
5 participants